Machine Learning in Big Data Analysis
Introduction
Everyday, enormous volumes of data are being generated from several applications of different sectors. Data generated from various applications and devices are growing at astonishing speed across the globe. Conventional data processing applications are unable to process and analyze big data and thus machine learning methods can be applied efficiently to learn hidden data patterns, trends, and associations.
Big Data Analytics
There is a growing research interest in the domain of big data analytics that deals with the collection, storage, and analysis of huge datasets to find hidden patterns and other key information. It helps us to produce useful knowledge and information that are integral components of business decisions. It is successfully applied in vast and varied application sectors, namely social media, economy, finance, healthcare, agriculture, etc. The term big data describes the large volume of structured and unstructured data that inundates a business on a day-to-day basis. These enormous quantities of datasets arrive from different sources like sensors, transactional applications, web and social media, etc. and traditional databases are not capable of handling big data due to the volume and complexity.
Machine Learning
Machine learning is an interdisciplinary area that is now used across various industriesandapplicationsduetotheincreaseincomputingpower,datacollection,andstoragecapabilities. Machine learning gives systems the ability to self-learn without any explicitprogramandthe mainfocusof machine learning researchistomake efficientlearningalgorithmsthatcan
- processandanalyzethedata,
- understand its details and trends, and
(iii) make predictions on data.
Machine learning is divided into mainly three categories:
(a) supervised - it requires training with labelled data, and labeled training data consists of the input value and target output value. The supervised algorithm analyzes the training data and learns from it, and classifies new data or predicts outcomes.
(b) unsupervised - it is training on raw and unlabelled training data and hidden insights are drawn from unlabelled datasets.
(c) reinforcement learning - here machine learns its behavior from the feedback received through the interactions with the external environment.
Nowadays machine learning algorithms are popularly used by data scientists to analyze and extract concealed insights from datasets. However, most of the traditional machine learning algorithms were developed for datasets that can completely fit into the memory [1] and these algorithms were not originally implemented for big datasets. Hence, the computational complexity of these algorithms decreases with the increase in data size and ultimately this makes big data analytics extremely slow or unrealistic. Therefore, there is a certain need for effective techniques and algorithms for big data analytics and over the last decade, researchers have designed many machine learning methods to provide solutions to several big data analytics problems.
First data is extracted from multiple sources and collected data is processed through a machine learning model to find the hidden patterns and predict (or classify) outcomes. Thus, the machine learning model gives certain recommendations to the system. The process ends if the recommendation is accepted, otherwise, the model starts to process again to find a satisfactory result.
Literature Review
Authors of paper [3][4][5], proposed several machine learning algorithms for big data analysis considering feature selection, feature extraction, and distance metric learning. It is also shown that the performance of the algorithms is influenced by the selection of features. Feature selection finds the most important features of data for model creation and feature extraction transforms high dimensional data into low dimensional space. The distance metric calculates the distance between several points of a dataset. In the paper [6], the authors proposed distributed machine learning to solve the scalability issue. Here, datasets are distributed among several workstations, and the learning is carried out on these datasets to scale up the learning process.
In general, both the training data and test data are taken from the same domain. But, in some cases, it can be difficult and expensive. Transfer learning (which is another machine learning approach) is widely used to solve this problem. In work [7], the authors addressed several scenarios of transfer learning approach. Unlabelled data is very difficult for learning and active learning is used to solve this problem. Authors of [8], proposed an active learning model for labeling the unlabelled data.
Authors of the paper [9], highlighted frequently used machine learning algorithms for big data analytics. Several algorithms, namely decision tree algorithms, bayesian algorithms, support vector machines, artificial neural network, K-means, etc. were discussed in this work. Several frameworks, namely map-reduce frameworks (apache hadoop and spark), google’s tensor flow, microsoft’s azure-ML were also summarized.
More and more emerging machine learning approaches and techniques are being designed by researchers in the domain of big data analytics to solve real-world problems.
Conclusion
The advancement of big data technology makes it very difficult to handle complex big data using traditional learning algorithms. Therefore, efficient machine learning models are required to process, manage, and analyze huge chunks of heterogeneous datasets. The results or outputs generated through these models provide effective solutions to several real-world problems in various sectors, such as healthcare, agriculture, social media, banking, etc. In future, machine learning implementation in big data analysis will rise because of increasing demand in different fields.
References:
- W. Chen and X. Lin, “Big Data Deep Learning: Challenges and Perspectives”, in IEEE Access, vol. 2, pp. 514-525, 2014.
- Rajendran, P. Sharma, N. K. Saran, S. Ray, J. Alanya-Beltran and K. Tongkachok, "An Exploratory analysis of Machine Learning adaptability in Big Data Analytics Environments: A Data Aggregation in the age of Big Data and the Internet of Things," Proceedings of the 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM)/IEEE, Gautam Buddha Nagar, India, pp. 32-36, 2022.
- Qui, Q. Wu, G. Ding, Y. Xu and S. Feng, “A survey of machine learning for big data processing”, EURASIP Journal on Advances in Signal Processing, Springer, vol. 2016:67, pp. 1-16, 2016.
- Tu and S. Sun, “Cross-domain representation-learning framework with combination of class-separate and domainmerge objectives”, Proceedings of the 1st International Workshop on Cross Domain Knowledge Discovery in Web and Social Network Mining/ACM, Beijing, China, pp. 18–25, 2012.
- Bengio, A. Courville and P. Vincent, “Representation Learning: A Review and New Perspectives”, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, issue 8, pp. 1798-1828, 2013.
- D Peteiro-Barral and B Guijarro-Berdinas, “A survey of methods for distributed machine learning”, Progress in Artificial Intelligence, Springer, 2, issue 1, pp. 1-11, 2013.
- K Weiss, T Khoshgoftaar and D Wang, “A survey of transfer learning”, Journal of Big Data, Springer, vol. 3, issue 9, pp. 1-40, 2016.
- Fu, B. Li, X. Zhu and C. Zhang, “Active Learning without Knowing Individual Instance Labels: A Pairwise Label Homogeneity Query Approach”, in IEEE Transactions on Knowledge and Data Engineering, vol. 26, issue 4, pp. 808-822, 2014.
- L. Berral-Garcia, “A quick view on current techniques and machine learning algorithms for big data analytics”, 18th International Conf. on Transparent Optical Networks, pp.1-4, 2016.
Written by Dr. KalyanBaital, Scientist-D, NIELIT Kolkata